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Abstract— In this paper, we introduce the edge-valued binary 
decision diagram (EYBDD) to reduce the memory and delay in 
numerical function generators (NFGs). An NFG realizes a func¬ 
tion, such as a trigonometric, logarithmic, square root, or recip¬ 
rocal function, in hardware. NFGs are important in, for exam¬ 
ple, digital signal applications, where high speed and accuracy are 
necessary. We use the EVBDD to produce a fast and compact seg¬ 
ment index encoder (SIE) that is a key component in our NFG. We 
compare our approach with NFG designs based on multi-terminal 
BDD’s (MTBDDs), and show that the EVBDD produces SIEs that 
have, on average, only 7% of the memory and 40% of the delay 
of those designed using MTBDDs. Therefore, our NFGs based on 
EVBDDs have, on average, only 38% of the memory and 59% of 
the delay of NFGs based on MTBDDs. 

I. Introduction 

There has been significant interest recently in the realiza¬ 
tion of numeric functions, like sin(jtx), ln(x), \/x, and -fx, by 
high speed logic circuits. This is, in part, due to the availabil¬ 
ity of large quantities of inexpensive, programmable logic in 
FPGA’s, and, in part, to the development of realization meth¬ 
ods based on polynomial approximations [3, 5, 6, 8,16,22, 23], 
Until the last few years, the dominant approach has been to 
partition the domain into uniform segments. Within each seg¬ 
ment a linear [23] or higher order [3, 5, 6, 8, 16, 22] ap¬ 
proximation is used to represent the function. It has been 
shown [7] that linear approximation is well suited for certain 
‘simple’ functions like 2 X , sin(ju), and cos(roc), but is inappro¬ 
priate for ‘complex’ functions like \Jx and the entropy func¬ 
tion, — (xlog^x) + (1 — x)log 2 (l — x)). For complex functions, 
optimum non-uniform segmentation produces tractable realiza¬ 
tions [2]. In this method, segments are chosen as wide as possi¬ 
ble, while still achieving the specified accuracy. Typically, nar¬ 
row segments are used where the function changes rapidly and 
wide segments are used in other regions. A segment index en¬ 
coder (SIE) is therefore needed to map the values in the domain 
to the segments. Within each segment the coefficients are the 
same for all points in the segment. As in uniform segmentation, 
a memory stores the coefficient values, which are then used to 
form the polynomial approximation. Potentially, the SIE de¬ 
signed for an optimum non-uniform segmentation is a com¬ 
plex circuit. To simplify the SIE, two approaches have been 
proposed. One is a segmentation approach [10, 11] that uses 
a special (non-optimum) non-uniform segmentation. Another 
one is a realization approach [20, 21 ] that uses an LUT cascade 
to realize the optimum non-uniform segmentation compactly. 

In this paper, we use both approaches to simplify the SIE. 
That is, this paper proposes a new segmentation approach and 
a new realization approach using an edge-valued binary deci¬ 
sion diagram (EVBDD). Our segmentation approach can also 


reduce the memory size and the delay time of an LUT cascade 
using a multi-terminal BDD (MTBDD). For both approaches, 
we establish a formal synthesis procedure that is easily pro¬ 
grammed. 


II. Preliminaries 

A. Number Representation and Precision 

Definition 1 A value X represented by the binary fixed-point 
representation is denoted by X = (x; i x/ 2 ■■■ X\ a'o- x i x 2 
... X- m ) 2 , where Xj £ {0,1}, / is the number of bits for the 
integer part, and m is the number of bits for the fractional part. 
This representation is two’s complement. 

Definition 2 Error is the absolute difference between the exact 
value and the value produced by the hardware. Approximation 
error is the error caused by a function approximation. Round¬ 
ing error is the error caused by a binary fixed-point representa¬ 
tion. Acceptable error is the maximum error that an NFG may 
assume. Acceptable approximation error is the maximum ap¬ 
proximation error that a function approximation may assume. 

Definition 3 Precision is the total number of bits for a binary 
fixed-point representation. Specially, n-bit precision specifies 
that n bits are used to represent the number; that is, n = 1 + m. 
We assume that an n-bit precision NFG has an n-bit input. 

Definition 4 Accuracy is the number of bits in the fractional 
part of a binaty fixed-point representation, m-bit accuracy 
specifies that m bits are used to represent the fractional part 
of the number. When the maximum error is 2~ m , the accuracy 
can be expressed as 1 unit in the last place (ULP). In this pa¬ 
per, an m-bit accuracy NFG is an NFG with an m-bit fractional 
part of the input, an m-bit fractional part of the output, and 1 
ULP ' 

B. Edge-Valued Binary Decision Diagram 

Definition 5 A binary decision diagram (BDD) [1] is a rooted 
directed acyclic graph representing a logic function: {0.1}" —> 
{0,1}. The BDD is obtained by repeatedly applying the Shan¬ 
non expansion to the logic function. Each function, including 
the original function and all sub-functions resulting from ap¬ 
plying the Shannon expansion, is represented by a non-terminal 
node, unless that function is a trivial function, 0 or 1, in which 
case, it is represented by a terminal node. Each non-terminal 
node has two outgoing edges, 0 -edge and 1 -edge, that corre¬ 
spond to the values of input variables. Both terminal nodes 
have no outgoing edges. 
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(a) Terminal case. (b) Non-terminal case. 


Fig. 1. Conversion of an MTBDD node into an EVBDD node. 
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(a) Function table. 




Fig. 2. MTBDD and EVBDD for an integer function. 


Definition 6 A multi-terminal BDD (MTBDD) [4] is an ex¬ 
tension of the BDD , and represents an integer function: 
{0,1}" —> Z, where Z is a set of integers. Specifically, it is a 
BDD in which the terminal nodes are not restricted to 0 and 
1. Rather, they are labeled by integer values. Alternatively, we 
can think of a BDD as a special case of an MTBDD, hi which 
there are only two terminal nodes, labeled 0 and 1. 

Definition 7 An edge-valued BDD (EVBDD) [9] is an ex¬ 
tension of the BDD, and represents an integer function. An 
EVBDD consists of one terminal node representing 0 and non¬ 
terminal nodes with a weighted l-edge, where the weight is 
an integer. An EVBDD is obtained by recursively applying the 
conversion shown in Fig. 1 to each non-terminal node in an 
MTBDD, where in Fig. 1, dashed lines and solid lines denote 
0-edges and 1 -edges, respectively. Note that, in the EVBDD, 
0-edges (dashed lines) have weight 0, while the incoming edge 
into the root node can have some weight. 

For more detail on these BDDs, refer to [17]. 

Example 1 Fig. 2(b) and (c) show an MTBDD and an EVBDD 
for the integer function f defined by Fig. 2(a). In Fig. 2, dashed 
lines and solid lines denote 0-edges and l-edges, respectively. 
Note that the EVBDD has weighted {-edges. In the MTBDD, 
terminal nodes represent function values. Thus, to evaluate the 
function, we traverse the MTBDD from the root node to a ter¬ 
minal node according to the input values, and obtain the func¬ 
tion value (an integer) from the terminal node. On the other 
hand, in the EVBDD, we obtain the function value by summing 


Input: 

Numerical function f(X), domain [A, B) for A, accuracy 
m,,, of X, polynomial order d, and acceptable approxima¬ 
tion error e fl . 

Output: 

Segments [A,P 0 ), [B),Pi), ..., [P t - 2 ,B). 

Step: 

1 . 

For [A,B), compute the maximum approximation error 

2. 

Zd(A,B). 

If Zd(A,B) < e a or B — A < 2~ m ‘", then stop. 

3. 

Else, partition [A,6) into two segments [A,P) and [P.B), 

4. 

where P = (A + B)/2. 

Repeat Steps 1, 2, and 3 for each new segment recursively, 


until the maximum approximation errors are smaller than 
e a in all segments. 


Fig. 3. Recursive segmentation algorithm for the domain. 

the weights of the edges traversed from the root node to the 
terminal node. (End of Example) 

Ill. Piecewise Polynomial Approximation Based on 
Non-uniform Segmentation 

To approximate the numerical function f(X) using polyno¬ 
mial functions, we first partition the domain for X into seg¬ 
ments. For each segment, we approximate f(X) using a poly¬ 
nomial function specific to that segment. In many cases, the 
domain is partitioned into uniform segments. Such methods are 
useful for elementary functions, such as sin( ttA'), but for some 
numerical functions, such as — (Xlog 2 (A) + (1 — X)log 2 (l — 
A)), too many segments are required, resulting in large mem¬ 
ory. To reduce the number of segments, we use a non-uniform 
segmentation, called recursive segmentation. 

A. Recursive Segmentation Algorithm 

Fig. 3 shows a recursive segmentation algorithm. The in¬ 
puts for this algorithm are a numerical function f(X), a do¬ 
main [A,B) for X, an accuracy m ln of X, a polynomial order 
d, and an acceptable approximation error e«. Then, this algo¬ 
rithm produces t segments [A,Po), [Po,-Pi)> • • •, \Pt- 2 ,B) by re¬ 
cursively partitioning a segment into two equal-sized segments 
until achieving the acceptable approximation error e a in all seg¬ 
ments. Note that this algorithm restricts the width w, of each 
segment to Wj = 2 h ‘ x 2~ m ‘ n , where hj is an integer. That is, the 
segmentation points P, are restricted to values of which the least 
significant h, bits are 0 (i.e., Pj = (... p~j +1 p~j 00 ... 0 ) 2 , 
where j = m, n — hi). As shown in Fig. 3, the number of seg¬ 
ments depends on the maximum approximation error Zj(A,B). 
In this paper, we use the Chebyshev approximation polynomi¬ 
als. For a segment [S,E] of f(X), the maximum approxima¬ 
tion error of the c/th-order Chebyshev approximation 8 d{S,E) 
is given by [12]: 

where fO+A is the (d + l)th-order derivative of /. 

B. Computation of the Approximate Value 

For each segment, f(X) is approximated by the corre¬ 
sponding polynomial function g(X,i). That is, the approx¬ 
imated value of f(X) is computed by g(X,i) = Cd(i)X d + 
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x 



f(X) 


X — ( ... X—j X—j— i ... x i njr! ) 2 

— Si =( ...s-j 0 ... 0 ) 2 

( ... 0 x-j- 1 ... x- mtn ) 2 

X_=( ...X-j X-j-l ... X-m in )2 

&5,=( ...fry l ... l ) 2 

( ... 0 X_ y -_ i ... X- min ) 2 

Note: j = w m - /t,-. 


(a) Architecture for NFG. (b) Computation of X — Si using AND gates. 



(b) LUT cascade 
(MTJSIE). 


(c) LUT cascade and 
adders (EV_SIE). 


Fig. 5. Segment index encoders. 


Fig. 4. Architecture for the NFG based on 2nd-order polynomial. 


Cd-i(i)X d ~ l +... + Co(i), where i is a segment index assigned 
to each segment, and the coefficients (*), i (/), - - -, Co (*) 
are derived from the t/th-order Chebyshev approximation poly¬ 
nomial [12], 

For each segment [■S',, Ef), substituting X .S', + Sj for X yields 
the transformation 

g(X,i) = 

Q (i) (X - Si) d + C d _ x (i) (X - Si ) d - 1 +... + C' 0 (i), (1) 

where 

Cj(i) = I ( J + k )c j+k (i)S^ [j = 0,l,...,d — 1). 

A=0 V J / 

This transformation reduces the multiplier size (see Section IV- 
B). 


IV. Architecture for NFG 

Fig. 4 shows the architecture for the NFG based on a 2nd- 
order polynomial. As shown in Fig. 4(a), polynomials of the 
form (1) are realized using a segment index encoder ( SIE ), a 
coefficients table, circuits for (X — Sf) k (k = d,d— 1,...,2), 
multipliers, and adders. This architecture can realize any non- 
uniform segmentation. However, when recursive segmentation 
is used, we can realize X — Sj using 2-input AND gates instead 
of an adder. As mentioned in the previous section, the least sig¬ 
nificant hj bits of Si are 0, and X — Sj < 2 h ‘ x 2~ m ‘ n . Therefore, 
X — Sj has 1 ’s only in the least significant hj bits, and these 1 ’s 
occur in exactly the same position as the l’s in X. Thus, as 
shown in Fig. 4(b), we realize X — Sj using AND gates driven 
on one side by Sj, the complement of Sj. The SIE converts X 
into a segment index i. It realizes the segment index function 

seg-func(X) : {0,1}" —> {0,1 ,t — 1} shown in Fig. 5(a), 

where X has n bits, and t denotes the number of segments. 

A. Architecture of SIE 

Fig. 5(b) shows an LUT cascade [20] that realizes 
seg-func(X). The LUT cascade is obtained by functional de¬ 
compositions using an MTBDD for seg-func(X) [18, 19], and 


can realize any seg-func(X), where the size depends on the 
number of segments. [15] has shown that the size of an LUT 
cascade can be reduced by reducing the number of segments. 
This section presents a new architecture for the SIE to reduce 
the size and delay time further. Fig. 5(c) shows the new archi¬ 
tecture. To realize seg-func(X) using the SIE in Fig. 5(c), we 
represent seg-func(X) using an EVBDD. And then, by decom¬ 
posing the EVBDD, we obtain the SIE that consists of an LUT 
cascade and adders. In an LUT cascade, the interconnecting 
lines between adjacent LUTs are called rails. In this case, the 
rails represent sub-functions in the EVBDD. And, the outputs 
from each LUT other than rails represent the sum of weights of 
edges. In this paper, we call such outputs Arails (adder rails). 
To the best of our knowledge, this is the first design method 
using an EVBDD to produce the cascaded architecture. 

Example 2 By decomposing the MTBDD and EVBDD in 
Fig. 2, we obtain the SIEs in Fig. 6. Fig. 6(a) and (b) illustrate 
the correspondences between each LUT and decompositions of 
the MTBDD and the EVBDD. respectively. In these figures, 
the column labeled as V, ’ in the table of each LUT denotes the 
rails that represent sub-functions in BDDs. And, the column 
‘af in Fig. 6(b) denotes the Arails that represent the sum of 
weights of edges. In the MTBDD. numbers assigned to edges 
that cut across the horizontal lines represent sub-functions. In 
the EVBDD, “(aj.rf)" assigned to edges that cut across the 
horizontal lines represent the sum of weights and sub-functions, 
respectively. The SIE in Fig. 6(a) requires a memory size of 
2 2 x 2 + 2^ x 3 + 2 4 x 3 = 80 bits and 3 levels (3 LUTs). On 
the other hand, the SIE in Fig. 6(b) requires a memory size of 
2 2 x 4 + 2 2 x 2 + 2 2 x 1 = 28 bits and 4 levels (3 LUTs + 1 
adder). (End of Example) 

This paper uses two terms: MTJSIE and EVA IE denote the 
SIEs designed using an MTBDD (Fig. 5(b)) and EVBDD 
(Fig. 5(c)), respectively. Both the MT_SIE and the EV_SIE can 
realize any non-uniform segmentation. In both cases, memory 
size depends on the number of segments. Specifically, 

Theorem 1 Let seg_func(X) be a segment index function with 
t segments. Then, there exists an EVAIE for seg_f unc(X) with 
at most \ log 2 f] rails and [~log 2 1] Arails. 

The proof is omitted because of the page limitation. 

The memory size and the number of levels of an EV_SIE 
depend on the decomposition of an EVBDD. To obtain the op¬ 
timum decomposition, we use optimization algorithms for het¬ 
erogeneous multi-valued decision diagrams (MDDs) [14]. 
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(a) SIE using MTBDD (MT.SIE). 



(b) SIE using EVBDD (EV.SIE). 


Fig. 6. Example of SIEs. 


B. Reduction of the Size of the Multiplier 

Since large multipliers have large delay, it is important to 
reduce multiplier size. We do this in two ways; Reduce the 
number of bits needed to represent 1. the coefficients and 2. 
the variables (X — Si). 

To reduce the number of bits in the coefficients, we use a 
scaling method [10]. We first shift right the coefficients. Then, 
we apply rounding. Then, we do the actual multiplication. 
And, finally, we shift left the product to compensate for the 
original shift right of the coefficients. This process is similar 
to floating point multiplication. A side effect is that rounding 
error is increased, since rounding occurs on a smaller value. 
In applying this method, we choose the largest exponent (right 
shift) that produces an error no greater than the given accept¬ 
able error [15]. If this yields an exponent of 0 (no right shift), 
in all segments, then we do not use the scaling method. 

To reduce the value of the variable X — S t , we make the fol¬ 
lowing observation. In each segment [£,-,£)), we have X — Si < 
Ej — Si. Thus, reducing the segment width reduces X — Si for X 
near £). However, this also increases the number of segments, 
and thus the memory size. We show a segment reduction tech¬ 
nique that does not increase memory size. 

In an FPGA implementation, the coefficients table in Fig. 4 
has 2" words, where u = [log 2 t] and t is the number of seg¬ 
ments. Therefore, we can increase the number of segments up 
to t = 2" without increasing the memory size. From Theorem 1, 
the size of the EV_SIE also depends on the value of u. Increas¬ 
ing the number of segments to t = 2" rarely increases the size 


TABLE I 

Number of segments for various segmentation methods. 


X has 23-bit accuracy. 

Acceptable approximation error: 2~ 25 

Function 

n >o 

Domain 

[A, B) 

No. of 
uniform 
sees 

No. of 
nonuni. 
segs 

Recursive 

No. of 
seas 1 

No. of 
seas 2 

Time 

[msec.l 

e x 

sin(7tX) 

tan(7tX) 

arcsiNX) 

Vx 

V^Hx) 

Xln(X) 

[0,1) 

[0,0.5) 

0,0.5) 

[0,1) 

(0,1) 

(0,1) 

(0,1) 

128 

128 

4,194,304 

8,388,608 

8,388,607 

8,388,607 

2,097,152 

67 

74 

4,594 

256 

228 

698 

172 

103 

112 

5,723 

363 

322 

967 

250 

128* 

128* 

8.192 

512 

512 

1,024 

256 

10 

10 

1,600 

70 

30 

190 

10 


*Uniform segmentation is produced. 

Environment: Sub Blade 2500 (Silver), UltraSPARC-IIIi 1.6GHz, 
6GB memory, Solaris 9. 


of the EV_SIE. We reduce the size of segments by dividing the 
largest segment into two equal sized segments up to t = 2". 


V. Experimental Results 
A. Number of Segments and Computation Time 

Table I compares the number of segments for various seg¬ 
mentation methods based on 2nd-order Chebyshev approxima¬ 
tion. In Table I, “No. of uniform segs” shows the number 
of uniform segments, “No. of nonuni. segs” shows the num¬ 
ber of non-uniform segments produced by [15], and “Recur¬ 
sive” denotes the recursive segmentation method shown in this 
paper. In the column “Recursive”, the sub-column “No. of 
segs 1” shows the number of segments produced by the seg¬ 
mentation algorithm shown in Section III. The sub-column 
“No. of segs 2” shows the number of segments produced by 
additionally applying the reduction method of multiplier size 
shown in Section IV. The sub-column “Time” shows the total 
CPU time, in milliseconds, for both the segmentation algorithm 
and the reduction method of multiplier size. 

Table I shows that uniform segmentation requires exces¬ 
sively many segments to approximate certain functions, such 
as tan(jtV). Many existing NFGs are based on uniform seg¬ 
mentation, and have not realized tan(jtX) in domain [0,0.5). 
tan(7tX) in [0,0.5) can be computed by sin(7tX)/cos(7tX) or 
a combination of tan(7tX) in [0,0.25] and l/tan(7tX'), where 
X' = 0.5 — X. However, these require multiple NFGs for ele¬ 
mentary functions, such as sin, cos, or the reciprocal function. 
On the other hand, methods based on non-uniform or recur¬ 
sive segmentation can compactly realize tan(7tX) with a sin¬ 
gle NFG, since non-uniform and recursive segmentation meth¬ 
ods require many fewer segments. For all functions in Table I, 
the non-uniform segmentation method [15] requires the fewest 
segments among the three segmentation methods. Although 
our recursive segmentation algorithm restricts the segmenta¬ 
tion points, it requires only up to 2.2 times more segments than 
non-uniform segmentation [15]. That is, our recursive segmen¬ 
tation algorithm generates a segmentation appropriate to the 
given function, while restricting the segmentation points. For 
example, for e x and sin(jtX'), our algorithm generates uniform 
segmentation. As shown in [15, 21], uniform segmentation is 
appropriate for these functions. 

These results show that our recursive segmentation algo¬ 
rithm generates a non-uniform segmentation appropriate to the 
given functions quickly. 
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TABLE II 

FPGA IMPLEMENTATION OF SIES. 


FPGA device: Altera Stratix EP1S10F484C5 (LE: 10,570, M4K: 60, M512: 90) 

Logic synthesis tool: Altera QuartusII 5.0 (speed optimization, timing requirement of 200MHz) 


Function 

m 


Optimum 

non-uniform segmentation 


Recursive segmentation 

MT.SIE 

EV.SIE 

MT_SIE 

EV.SIE 

Memory 

Ibitsl 

LE 

Level 

Delay 

rnsec.l 

Memory 

Ibitsl 

LE 

Level 

Delay 
Insec. 1 

Memory 

Ibitsl 

LE 

Level 

Delay 
Insec.1 

Memory 

Ibitsl 

LE 

Level 

Delay 

rnsec.l 

(A 

26,368 

73 

8 

27.5 

23,040 

119 

7 

24.1 

0 

0 

0 

0 

0 

0 

0 

0 

sinOtX) 

26,880 

71 

8 

27.5 

23,552 

114 

7 

24.1 

0 

0 

0 

0 

0 

0 

0 

0 

tan(7tX) 

1,802,240 

— 

5 

- 

179,968 

207 

10 

36.3 

1,687,552 

- 

5 

- 

15,108 

142 

7 

24.1 

arcsin(X) 

61,440 

72 

8 

27.5 

53,824 

217 

13 

44.7 

49,152 

72 

7 

24.3 

9,984 

88 

5 

17.2 

Vx 

61,440 

72 

8 

27.5 

57,408 

183 

11 

40.3 

44,544 

71 

7 

24.1 

9,216 

91 

5 

17.2 

V~Hx) 

266,240 

81 

7 

33.2 

116,160 

204 

11 

40.2 

172,032 

71 

7 

28.1 

12,736 

109 

6 

20.6 

Xln(X) 

61,440 

79 

8 

27.5 

48,384 

160 

9 

30.9 

20,992 

49 

5 

17.2 

6,912 

65 

4 

13.8 


-: It cannot be mapped into the FPGA due to insufficient RAM blocks. 


B. FPGA Implementation of SIEs 

Table II shows that the recursive segmentation algorithm also 
automatically generates uniform segmentation when appropri¬ 
ate. Table II compares the FPGA implementation results of the 
MT_SIE and EV_SIE. Note that the memory size in bits and 
LE, the number of logic elements, are 0 for e x and sin(7tX) 
when recursive segmentation is used. This indicates that uni¬ 
form segmentation was applied, and so an SIE was not needed. 
The result is a faster NFG for these functions. In the experiment 
that produced the data in Table II, we optimized the decompo¬ 
sition of the MTBDDs and EVBDDs by requiring the memory 
size of each LUT in the LUT cascade in these SIEs to be 4K 
bits, the same as the RAM block (M4K) of the FPGA. 

Table II shows that for optimum non-uniform segmentation, 
the EV_SIEs have smaller memory size than the MT_SIEs. For 
example, for tan(7tX), the memory size of the EV_SIE is only 
10% of the memory size needed by the MT_SIE. For tan(7tX), 
the memory size of MT_SIE is quite large because the number 
of non-uniform segments is large. From experiments with uni¬ 
form segmentation we know that the NFG for tan(7tX) using 
the MT_SIE requires only 1.5% of the memory size needed by 
the NFG based on uniform segmentation. However, this is still 
too large to implement with an FPGA. By using the EV_SIE, 
we can reduce the memory size significantly, and make the 
NFG implementable with an FPGA. 

Our recursive segmentation can reduce both the memory size 
and the delay time of the MT_SIEs. Especially, forXln(X), us¬ 
ing an MT_SIE designed for recursive segmentation has only 
34% of the memory and 63% of the delay of the MT_SIE de¬ 
signed for optimum non-uniform segmentation. 

By using recursive segmentation and the EV_SIE, we can re¬ 
duce both memory size and delay time of SIEs significantly. 
For all functions in Table II, both memory size and delay time 
of the EV_SIEs for recursive segmentation are much smaller for 
the MT_SIEs. In terms of the number of LEs, the EV_SIEs re¬ 
quire only up to 1.5 times more LEs than the MT_SIEs. There¬ 
fore, designing an EV_SIE for recursive segmentation yields 
faster and more compact SIEs than obtained by previous meth¬ 
ods. The design is formal and is easily programmed. 

C. FPGA Implementation ofNFGs 

Table III compares the FPGA implementation results of our 
NFGs using EV_SIE (EVNFGs) with the existing NFGs us¬ 
ing MT_SIE (MTNFGs) [15], where EVNFGs are based on re¬ 
cursive segmentation and MTNFGs are based on the optimum 
non-uniform segmentation. Both NFGs have 23-bit precision 
(23-bit accuracy). 


From Table II and Table III, we can see that the memory 
size of MT_SIE accounts for more than 2/3 of the total mem¬ 
ory size of the MTNFG. On the other hand, by using recursive 
segmentation and EV_SIE, the memory size needed for the SIE 
can be reduced to less than 1 /4 of the total memory size of 
the EVNFG. Thereby, the EVNFGs require only 21% to 63% 
of memory size needed for the MTNFGs. For arcsin(X) and 
Vx, as shown in Table I, our recursive segmentation requires 
a coefficients table that is about twice as large as needed for 
the optimum non-uniform segmentation. Nevertheless, by us¬ 
ing EV_SIEs, the memory sizes of EVNFGs can be reduced 
to about 63% of the memory sizes of MTNFGs. Further, Ta¬ 
ble III shows that the EVNFGs require fewer LEs and levels 
(i.e., shorter latency) than the MTNFGs, and the delay time of 
EVNFGs is only about 25% to 89% of the delay time of the 
MTNFGs. 

To compare our NFG with another existing NFG based on 
a segmentation approach (hierarchical segmentation) shown in 
[11], we implemented our 24-bit precision NFG for Xln(X) 
using the Xilinx Virtex-II FPGA (XC2V4000-6) and the Syn- 
plify Premier 8.5. Memory size and delay time of the NFG 
described in [11] are 40,446 bits and 103.7 nsec.. On the other 
hand, memory size and delay time of our EVNFG based on 
recursive segmentation are 30,976 bits (77%) and 63.3 nsec. 
(61%). 

From these results, we can see that our NFGs using recur¬ 
sive segmentation and the EV_SIE can realize a wide range of 
functions faster and more compactly than existing NFGs. 

VI. Conclusion and Comments 

We have presented an architecture and a synthesis method 
for fast and compact NFGs for trigonometric, logarithmic, 
square root, reciprocal, and combinations of these functions. 
Our NFG partitions a given domain of the function into non- 
uniform segments using recursive segmentation, and approx¬ 
imates the given function by a polynomial function for each 
segment. By using an EVBDD to realize the recursive segmen¬ 
tation, we can implement fast and compact NFGs for a wide 
range of functions. Experimental results showed that: 1) By 
using recursive segmentation, we reduced memory size and de¬ 
lay time needed for the MT_SIE, and produced MT_SIEs that 
have, on average, only 49% of the memory and 53% of the 
delay of MT_SIEs for the optimum non-uniform segmentation. 
2) By using EVBDD to realize recursive segmentation, we fur¬ 
ther reduced memory size and delay time needed for the SIE. 
Our SIEs using the EVBDDs require, on average, only 7% of 
the memory and 40% of the delay of MT_SIEs for the opti- 
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TABLE III 

FPGA IMPLEMENTATION OF 23-BIT PRECISION (23-BIT ACCURACY) NFGS. 


FPGA device: 

Logic synthesis tool: 

Altera Stratix EP1S60F1020C5 

(LE: 57,120, DSP: 144, M4K: 292, M512: 574) 

Altera QuartusII 5.0 

(speed optimization, timing reauirement of 200MHz) 


Function 

f(X) 

MTNFG based on optimum nonuni. 

EVNFG based on recursive 

Memory 

Tbitsl 

LE 

DSP 

Level 

Delay 

lnsec.1 

Memory 

Ibitsl 

LE 

DSP 

Level 

Delay 

lnsec.1 

e x 

39,040 

689 

10 

13 

99.6 

8,064 

432 

10 

3 

25.1 

sin(jtXl 

36,864 

635 

10 

13 

99.1 

7,936 

395 

10 

3 

28.3 

tan(7tX) 

2,867,200 

— 

16 

11 

- 

973,572 

1,059 

16 

12 

92.3 

arcsin(X) 

84,736 

1,301 

16 

14 

107.3 

53,504 

937 

16 

10 

80.3 

Vx 

83,712 

1,041 

16 

14 

116.5 

52,224 

962 

16 

10 

85.5 

V-Hx) 

357,376 

950 

16 

13 

99.8 

103,872 

972 

16 

11 

88.3 

Xln(X) 

83,200 

988 

16 

14 

116.0 

29,696 

997 

16 

9 

70.3 


It cannot be mapped into the FPGA due to insufficient RAM blocks. 


mum non-uniform segmentation. And therefore, 3) our NFGs 
require, on average, only 38% of the memory and 59% of the 
delay needed by the existing NFGs based on MT_SIE and the 
optimum non-uniform segmentation. 


Acknowledgments 

This research is partly supported by the Grant in Aid for 
Scientific Research of the Japan Society for the Promotion of 
Science (JSPS), funds from Ministry of Education, Culture, 
Sports, Science, and Technology (MEXT) via Kitakyushu in¬ 
novative cluster project, a contract with the National Secu¬ 
rity Agency, the MEXT Grant-in-Aid for Young Scientists (B), 
18700048,2006, and Hiroshima City University Grant for Spe¬ 
cial Academic Research (General Studies), 6101, 2006. 


References 

[1] R. E. Bryant, “Graph-based algorithms for boolean function manipula¬ 
tion,” IEEE Trans. Comput., Vol. C-35, No. 8, pp. 677-691, Aug. 1986. 

[2] A. Cantoni, “Optimal curve fitting with piecewise linear functions,” 
IEEE Trans, on Comp., Vol. 20, No. 1, pp. 59-67, Jan. 1971. 

[3] J. Cao, B. W. Y. Wei, and J. Cheng, “High-performance architectures 
for elementary function generation,” Proc. of the 15th IEEE Symp. on 
Computer Arithmetic (ARITH’01), Vail, Colorado, pp. 136-144, June 
2001. 

[4] E. M. Clarke, K. L. McMillan, X. Zhao, M. Fujita, and J. Yang, “Spec¬ 
tral transforms for large Boolean functions with applications to tech¬ 
nology mapping,” Proc. of 30th ACM/IEEE Design Automation Confer¬ 
ence, pp. 54-60, June 1993. 

[5] D. Defour, F. de Dinechin, and J.-M. Muller, “A new scheme for table- 
based evaluation of functions,” 36th Asilomar Conference on Signals, 
Systems, and Computers,, Pacific Grove, California, pp. 1608-1613, 
Nov. 2002. 

[6] J. Detrey and F. de Dinechin, “Table-based polynomials for fast hard¬ 
ware function evaluation,” 16th IEEE Inter. Conf. on Application- 
Specific Systems, Architectures, and Processors (ASAP'05), pp. 328- 
333, 2005. 

[7] C. L. Frenzen, T. Sasao, and J. T. Butler, “The tradeoff between memory 
size and approximation error in numerical function generators based on 
lookup tables,” preprint. 

[8] V. K. Jain, S. A. Wadekar, and L. Lin, “A universal nonlinear component 
and its application to WSI,” IEEE Trans, on Components, Hybrids, and 
Manufacturing Technology, Vol. 16, No. 7, pp. 656-664, Nov. 1993. 

[9] Y-T. Lai, M. Pedram, and S. B. Vrudhula, “EVBDD-based algorithms 
for linear integer programming, spectral transformation and functional 


decomposition,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 
Vol. 13, No. 8, pp. 959-975, Aug. 1994. 

[10] D.-U. Lee, W. Luk, J. Villasenor, and P. Y. K. Cheung, “Non-uniform 
segmentation for hardware function evaluation,” Proc. Inter. Conf. on 
Field Programmable Logic and Applications, pp. 796-807, Lisbon, Por¬ 
tugal, Sept. 2003. 

[11] D.-U. Lee, W. Luk, J. Villasenor, and P. Y. K. Cheung, “Hierarchi¬ 
cal segmentation schemes for function evaluation,” Proc. of the IEEE 
Conf. on Field-Programmable Technology, Tokyo, Japan, pp. 92-99, 
Dec. 2003. 

[12] J. H. Mathews, Numerical Methods for Computer Science, Engineering 
and Mathematics, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1987. 

[13] J.-M. Muller, Elementary Function: Algorithms and Implementation, 
Birkhauser Boston, Inc., Secaucus, NJ, 1997. 

[14] S. Nagayama and T. Sasao, “On the optimization of heterogeneous 
MDDs,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., Vol. 24, 
No. 11, pp. 1645-1659, Nov. 2005. 

[15] S. Nagayama, T. Sasao, and J. T. Butler, “Programmable numerical 
function generators based on quadratic approximation: architecture and 
synthesis method,” Proc. of Asia and South Pacific Design Automation 
Conference (ASPDAC’06), Yokohama, Japan, pp. 378-383, 2006. 

[16] J.-A. Pineiro, S. F. Oberman, J.-M. Muller, and J. D. Bruguera, “High¬ 
speed function approximation using a minimax quadratic interpolator,” 
IEEE Trans, on Comp., Vol. 54, No. 3, pp. 304-318, Mar. 2005. 

[17] T. Sasao and M. Fujita (eds.), Representations of Discrete Functions, 
Kluwer Academic Publishers 1996. 

[18] T. Sasao, M. Matsuura, and Y. Iguchi, “A cascade realization of 
multiple-output function for reconfigurable hardware,” Inter. Workshop 
on Logic Synthesis (IWLSVl), Lake Tahoe, CA, pp. 225-230, June 12- 
15, 2001. 

[19] T. Sasao and M. Matsuura, “A method to decompose multiple-output 
logic functions,” 41st Design Automation Conference, San Diego, CA, 
pp. 428^-33, June 2-6, 2004. 

[20] T. Sasao, J. T. Butler, and M. D. Riedel, “Application of LUT cas¬ 
cades to numerical function generators,” Proc. the 12th workshop on 
Synthesis And System Integration of Mixed Information technologies 
(SASIMP04), Kanazawa, Japan, pp. 422^-29, Oct. 2004. 

[21] T. Sasao, S. Nagayama, and J. T. Butler, “Programmable numerical 
function generators: architectures and synthesis method,” Proc. Inter. 
Conf. on Field Programmable Logic and Applications (FPL’05), Tam¬ 
pere, Finland, pp. 118-123, Aug. 2005. 

[22] M. J. Schulte and E. E. Swartzlamder, “Hardware designs for exactly 
rounded elementary functions,” IEEE Trans, on Comp., Vol. 43, No. 8, 
pp. 964—973, Aug. 1994. 

[23] J. E. Stine and M. J. Schulte, “The symmetric table addition method 
for accurate function approximation,” Jour, of VLSI Signal Processing, 
Vol. 21, No. 2, pp. 167-177, June 1999. 


540 




